LoFTK: a framework for fully automated calculation of predicted Loss-of-Function variants and genes

Background Loss-of-Function (LoF) variants in human genes are important due to their impact on clinical phenotypes and frequent occurrence in the genomes of healthy individuals. The association of LoF variants with complex diseases and traits may lead to the discovery and validation of novel therapeutic targets. Current approaches predict high-confidence LoF variants without identifying the specific genes or the number of copies they affect. Moreover, there is a lack of methods for detecting knockout genes caused by compound heterozygous (CH) LoF variants. Results We have developed the Loss-of-Function ToolKit (LoFTK), which allows efficient and automated prediction of LoF variants from genotyped, imputed and sequenced genomes. LoFTK enables the identification of genes that are inactive in one or two copies and provides summary statistics for downstream analyses. LoFTK can identify CH LoF variants, which result in LoF genes with two copies lost. Using data from parents and offspring we show that 96% of CH LoF genes predicted by LoFTK in the offspring have the respective alleles donated by each parent. Conclusions LoFTK is a command-line based tool that provides a reliable computational workflow for predicting LoF variants from genotyped and sequenced genomes, identifying genes that are inactive in 1 or 2 copies. LoFTK is an open software and is freely available to non-commercial users at https://github.com/CirculatoryHealth/LoFTK. Supplementary Information The online version contains supplementary material available at 10.1186/s13040-023-00321-5.


Supplementary text
Determining the optimal imputation quality threshold LoFTK was developed to analyze any genetic data, such as (imputed) genotypes and (exome) sequencing data. The imputed genotype data provides two quality metrics, which are INFO score and imputed alleles probability. The quality metrics can be used for filtering imputation results per individual variant. We determined the optimal imputation quality metrics in order to extract only the most genuine LoF variants. We used whole exome sequencing (WES) data from UK biobank (UKBB) as a gold standard for evaluating the optimal quality metrics to obtain the most genuine LoF variants from imputed genotypes data. We extracted WES and genotypes data of 4,476 randomly selected UKBB participants. Genotypes were phased and imputed using SHAPEIT2 and IMPUTE2, respectively [1][2][3]. Phasing and imputation were performed using a combined reference panel from the 1000 Genome project phase 3 [4] and Genome of the Netherlands (GoNL) study [5]. The imputed genotypes were submitted to LoFTK using a range of info scores and imputed allele probabilities in order to obtain various numbers of LoF variants for each subset. On the other hand, the UKBB provided unphased WES data, thus we phased the data using SHAPEIT2 and then converted to VCF. We retrieved overlapped variants between phased exomes and imputed genotypes data (Supplementary Figure 1).
In the imputed genotype data, we analyzed LoF variants for three different datasets, where the first has INFO > 0.3, the second has INFO > 0.6 and the third has INFO > 0.9. We compared the existence of each LoF variant between each imputed genotypes dataset and WES with considering the imputed allele probabilities (0.01 -0.1) for that variant (Supplementary Figure 1). For each individual, each LoF variant from the three imputed datasets was matched to WES, in order to count the false negative (average of LoF variants found in WES data but not in imputed data) and false positive (average of LoF variants found in imputed data but not in WES data) (Supplementary Table 1). The imputed dataset with INFO > 0.9 shows an optimal prediction of true LoF variants, because it has less false positive 2-copy LoF variants (~ three variants) compared to the others (0.1 and 0.3). However, selecting an optimal imputed allele probability was difficult due to the lack of apparent variations.

Compound heterozygous LoF variants
Compound heterozygous (CH) variants occur when both parents donate a single LoF allele to proband at distinct loci within the same gene [6]. LoFTK can annotate CH LoF variants, which introduce 2 inactive copies of genes. We used trio-families data from the Genome of the Netherlands (GoNL) study (Illumina Immunochip microarray SNP data) [5] to validate the transmitting of real CH LoF variants from parents to probands. We included 760 samples from the GoNL cohort, which they are family-based trios. We excluded variants with a call rate ≤ 0.99, Hardy-Weinburg equilibrium (HWE) ≤ 0.001 and monomorphic variants. After data filtrations, we used the TOPMed imputation server to impute missing genotypes [7]. As a post-imputation quality control step, we excluded variants with imputation quality score < 0.3 (INFO < 0.3) and monomorphic variants.
We utilized LoFTK to predict LoF variants and genes in the imputed genotypes of 250 families.
We discovered 250 CH LoF variants generating 2-copy LoF genes in 164 probands. There were